Transforms data (matching filenames) into information (time capsules)
People value information, not data.
Netflix Prize
In 2006, Netflix offered a $1M prize to whoever could write a better algorithm for predicting user ratings for films based on their previous film ratings using 100M ratings like so:
movie_id
user_id
rating
date
1
1488844
3
2005-09-06
1
822109
5
2005-05-13
1
885013
4
2005-10-19
In 2009, Netflix awarded the prize to a team that beat Netflix’s initial algorithm by 10%.
Inside Airbnb
Does Airbnb contribute to housing scarcity and high rents?
Are people renting spare rooms (“sharing economy”) or whole homes?
“Inside Airbnb” aligns Airbnb data with local housing maps
Returning entire home short-term rentals from lodging to the housing market would make 16% more rental housing units available across Dallas and up to 62% more in some Council Districts.
College Scorecard
In 2015, the US Department of Education released the College Scorecard
Scorecard tacitly assumes that the most important thing about education is return-on-investment
The Opportunity Atlas
Raj Chetty et al. use Census data to study economic mobility based on where children grew up. Richmond is exemplary of a trend they find nationally:
children’s outcomes vary sharply across nearby tracts: for children of parents at the 25th percentile of the income distribution, the standard deviation of mean household income at age 35 is $5,000 across tracts within counties.
We will create, organize, explore, visualize, and model different kinds of data, extending DSST289.
We will create professional-quality research reports and slides using scientific publishing software.
We will work with many data types this semester, but will emphasize data derived from and metadata describing texts.
What isn’t Advanced Data Science?
Not about causal inference (i.e., cause-and-effect)
DSST310: Causal Inference
Not primarily about machine learning
DSST312: Predictive Models
Not primarily about regression
DSST331: Regression Theory and Applications
What tools will we use?
R
tidyverse
Other R packages
RStudio
Quarto
Generative AI
What is “the whole game?”
flowchart LR
subgraph Get_Data [Get Data]
direction LR
A[Import]
G[Create]
end
A --> B[Tidy]
G --> B[Tidy]
B --> C[Transform]
subgraph Understand
direction LR
C --> D[Visualize]
D --> E[Model]
E --> C
end
Understand --> F[Communicate]
subgraph "The Whole Game"
direction LR
Get_Data
B
Understand
F
end
DSST289 emphases
flowchart LR
subgraph Get_Data [Get Data]
direction LR
A[Import]
G[Create]
end
A --> B[Tidy]
G --> B[Tidy]
B --> C[Transform]
subgraph Understand
direction LR
C --> D[Visualize]
D --> E[Model]
E --> C
end
Understand --> F[Communicate]
subgraph "The Whole Game"
direction LR
Get_Data
B
Understand
F
end
style A fill:#E69F00,stroke:#000,stroke-width:2
style C fill:#E69F00,stroke:#000,stroke-width:2
style D fill:#E69F00,stroke:#000,stroke-width:2
style F fill:#E69F00,stroke:#000,stroke-width:2
DSST389 emphases
flowchart LR
subgraph Get_Data [Get Data]
direction LR
A[Import]
G[Create]
end
A --> B[Tidy]
G --> B[Tidy]
B --> C[Transform]
subgraph Understand
direction LR
C --> D[Visualize]
D --> E[Model]
E --> C
end
Understand --> F[Communicate]
subgraph "The Whole Game"
direction LR
Get_Data
B
Understand
F
end
style G fill:#56B4E9,stroke:#000,stroke-width:2
style B fill:#56B4E9,stroke:#000,stroke-width:2
style E fill:#56B4E9,stroke:#000,stroke-width:2
style F fill:#56B4E9,stroke:#000,stroke-width:2